org_df <- read_excel("wuhan_blood_sample_data_Jan_Feb_2020.xlsx")
df <- org_df %>%
mutate(gender = as.factor(ifelse(gender==1, "male", "female"))) %>%
mutate(outcome = as.factor(ifelse(outcome == 0, "Survived", "Died"))) %>%
filter(!is.na(org_df$RE_DATE)) %>%
rename(admission_time = 'Admission time',
discharge_time = 'Discharge time',
hs_CRP = 'High sensitivity C-reactive protein')
names(df)[34] <- "Tumor necrosis factor alpha"
names(df)[37] <- "Interleukin 1 beta"
names(df)[68] <- "Gamma glutamyl transpeptidase"
The dataset consists of 81 variables and has 6106 observations (blood tests). See the summary below:
summary_df <- df %>% select(outcome, gender)
tbl_summary(
summary_df,
by = outcome,
label = gender ~ "Gender") %>%
modify_header(label ~ "**Variable**") %>%
add_overall() %>%
as_kable() %>% kable_paper("hover")
| Variable | Overall, N = 6,106 | Died, N = 2,897 | Survived, N = 3,209 |
|---|---|---|---|
| Gender | |||
| female | 2,388 (39%) | 749 (26%) | 1,639 (51%) |
| male | 3,718 (61%) | 2,148 (74%) | 1,570 (49%) |
The blood tests were taken from 361 different patients.
df %>% select(PATIENT_ID, gender, outcome) %>%
drop_na(PATIENT_ID) %>%
select(-PATIENT_ID) %>%
tbl_summary(label = gender ~ "Gender", by = outcome) %>%
add_overall() %>%
modify_header(label ~ "**Variable**") %>%
as_kable() %>% kable_paper("hover")
| Variable | Overall, N = 361 | Died, N = 166 | Survived, N = 195 |
|---|---|---|---|
| Gender | |||
| female | 149 (41%) | 46 (28%) | 103 (53%) |
| male | 212 (59%) | 120 (72%) | 92 (47%) |
From the cleaned dataset, two dataset are created Patients and Blood tests containg specific values, in order to make data analysis easier.
One additional column was created to store the hospitalization time of all patients - used in further analysis to check the relation between length stay and outcome. Go to section Patients Visualization to see basic visualizations about patients.
patients <- df %>%
select(PATIENT_ID, age, gender, admission_time, discharge_time, outcome) %>%
drop_na(PATIENT_ID) %>%
mutate("hospitalization_length" = round((difftime(discharge_time, admission_time, units = "days") ), digits = 2)) %>%
relocate(hospitalization_length, .after = discharge_time)
head(patients) %>%
kbl() %>%
kable_paper("hover")
| PATIENT_ID | age | gender | admission_time | discharge_time | hospitalization_length | outcome |
|---|---|---|---|---|---|---|
| 1 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | 17.60 days | Survived |
| 2 | 61 | male | 2020-02-04 21:39:03 | 2020-02-19 12:59:01 | 14.64 days | Survived |
| 3 | 70 | female | 2020-01-23 10:59:36 | 2020-02-08 17:52:31 | 16.29 days | Survived |
| 4 | 74 | male | 2020-01-31 23:03:59 | 2020-02-18 12:59:12 | 17.58 days | Survived |
| 5 | 29 | female | 2020-02-01 20:59:54 | 2020-02-18 10:33:06 | 16.56 days | Survived |
| 6 | 81 | female | 2020-01-24 10:47:10 | 2020-02-07 09:06:58 | 13.93 days | Survived |
blood_tests_df <- df %>%
select(-c(admission_time, discharge_time)) %>%
fill(PATIENT_ID)
markers_df <- blood_tests_df %>% select (-c(PATIENT_ID, age, RE_DATE, gender))
tbl_summary(
markers_df,
by = outcome,
missing = "no") %>%
modify_header(label = "**Marker**") %>%
add_n() %>%
bold_labels() %>%
as_kable() %>%
kable_paper("hover") %>%
scroll_box(width = "100%", height = "200px")
| Marker | N | Died, N = 2,897 | Survived, N = 3,209 |
|---|---|---|---|
| Hypersensitive cardiac troponinI | 507 | 70 (18, 631) | 3 (2, 7) |
| hemoglobin | 975 | 123 (110, 135) | 127 (116, 138) |
| Serum chloride | 975 | 104 (100, 111) | 101 (99, 103) |
| Prothrombin time | 662 | 16.3 (15.0, 18.2) | 13.6 (13.1, 14.1) |
| procalcitonin | 459 | 0.38 (0.14, 1.13) | 0.04 (0.02, 0.06) |
| eosinophils(%) | 957 | 0.00 (0.00, 0.10) | 0.70 (0.00, 1.80) |
| Interleukin 2 receptor | 268 | 1,180 (807, 1,603) | 529 (400, 742) |
| Alkaline phosphatase | 930 | 83 (64, 123) | 60 (50, 75) |
| albumin | 934 | 28 (24, 31) | 36 (34, 39) |
| basophil(%) | 957 | 0.10 (0.10, 0.20) | 0.20 (0.10, 0.40) |
| Interleukin 10 | 267 | 11 (6, 17) | 5 (5, 8) |
| Total bilirubin | 930 | 14 (10, 25) | 8 (6, 12) |
| Platelet count | 957 | 112 (55, 174) | 229 (176, 290) |
| monocytes(%) | 958 | 3.0 (2.0, 4.7) | 8.2 (6.3, 10.0) |
| antithrombin | 330 | 80 (70, 92) | 93 (86, 103) |
| Interleukin 8 | 268 | 30 (18, 61) | 11 (7, 19) |
| indirect bilirubin | 906 | 6.2 (4.2, 9.2) | 4.9 (3.4, 7.1) |
| Red blood cell distribution width | 923 | 13.20 (12.40, 14.40) | 12.20 (11.80, 12.80) |
| neutrophils(%) | 957 | 92 (88, 95) | 66 (56, 76) |
| total protein | 931 | 62 (57, 68) | 68 (65, 72) |
| Quantification of Treponema pallidum antibodies | 279 | 0.06 (0.04, 0.07) | 0.05 (0.04, 0.07) |
| Prothrombin activity | 659 | 66 (56, 78) | 94 (88, 103) |
| HBsAg | 279 | 0.01 (0.00, 0.02) | 0.00 (0.00, 0.01) |
| mean corpuscular volume | 957 | 91.3 (87.1, 96.4) | 89.8 (86.8, 91.9) |
| hematocrit | 957 | 35.9 (32.5, 39.8) | 37.1 (34.3, 39.9) |
| White blood cell count | 1,127 | 12 (8, 17) | 6 (4, 8) |
| Tumor necrosis factor alpha | 268 | 11 (8, 17) | 8 (6, 10) |
| mean corpuscular hemoglobin concentration | 957 | 342 (331, 350) | 343 (335, 350) |
| fibrinogen | 566 | 3.92 (2.44, 5.63) | 4.40 (3.56, 5.34) |
| Interleukin 1 beta | 268 | 5.0 (5.0, 5.0) | 5.0 (5.0, 5.0) |
| Urea | 936 | 11 (7, 17) | 4 (3, 5) |
| lymphocyte count | 957 | 0.46 (0.31, 0.69) | 1.25 (0.87, 1.62) |
| PH value | 384 | 6.50 (6.00, 7.41) | 6.50 (6.00, 7.00) |
| Red blood cell count | 1,127 | 4.0 (3.6, 4.6) | 4.2 (3.8, 4.7) |
| Eosinophil count | 957 | 0.00 (0.00, 0.01) | 0.03 (0.00, 0.09) |
| Corrected calcium | 914 | 2.35 (2.27, 2.44) | 2.37 (2.27, 2.44) |
| Serum potassium | 980 | 4.60 (4.04, 5.27) | 4.28 (3.92, 4.62) |
| glucose | 775 | 9.1 (6.9, 13.3) | 5.7 (5.0, 7.6) |
| neutrophils count | 957 | 10.8 (7.0, 15.2) | 3.5 (2.4, 5.2) |
| Direct bilirubin | 930 | 8 (5, 14) | 4 (2, 5) |
| Mean platelet volume | 862 | 11.30 (10.70, 12.20) | 10.40 (9.90, 11.00) |
| ferritin | 283 | 1,636 (928, 2,517) | 504 (235, 834) |
| RBC distribution width SD | 923 | 43.7 (39.9, 48.5) | 39.5 (37.6, 41.4) |
| Thrombin time | 566 | 17.30 (15.80, 19.75) | 16.40 (15.60, 17.30) |
| (%)lymphocyte | 958 | 4 (2, 7) | 24 (16, 33) |
| HCV antibody quantification | 279 | 0.07 (0.04, 0.11) | 0.06 (0.04, 0.08) |
| D-D dimer | 630 | 19 (3, 21) | 1 (0, 1) |
| Total cholesterol | 931 | 3.32 (2.72, 3.88) | 3.93 (3.39, 4.48) |
| aspartate aminotransferase | 935 | 38 (25, 59) | 21 (17, 29) |
| Uric acid | 934 | 245 (166, 374) | 240 (193, 304) |
| HCO3- | 934 | 21.8 (18.8, 24.7) | 24.7 (22.8, 26.7) |
| calcium | 979 | 2.00 (1.90, 2.08) | 2.17 (2.10, 2.25) |
| Amino-terminal brain natriuretic peptide precursor(NT-proBNP) | 475 | 1,467 (516, 4,578) | 64 (23, 166) |
| Lactate dehydrogenase | 934 | 593 (431, 840) | 220 (189, 278) |
| platelet large cell ratio | 862 | 35 (30, 42) | 28 (23, 33) |
| Interleukin 6 | 272 | 66 (30, 142) | 8 (2, 21) |
| Fibrin degradation products | 330 | 114 (18, 150) | 4 (4, 4) |
| monocytes count | 957 | 0.36 (0.20, 0.58) | 0.43 (0.32, 0.58) |
| PLT distribution width | 862 | 13.60 (12.10, 15.93) | 11.70 (10.70, 13.00) |
| globulin | 930 | 34.1 (30.2, 38.2) | 31.8 (29.5, 35.2) |
| Gamma glutamyl transpeptidase | 930 | 42 (27, 79) | 29 (19, 46) |
| International standard ratio | 659 | 1.31 (1.17, 1.48) | 1.04 (0.99, 1.09) |
| basophil count(#) | 957 | 0.010 (0.010, 0.030) | 0.010 (0.010, 0.020) |
| 2019-nCoV nucleic acid detection | 501 | ||
| -1 | 57 (100%) | 444 (100%) | |
| mean corpuscular hemoglobin | 957 | 31.20 (29.90, 32.70) | 30.70 (29.60, 31.90) |
| Activation of partial thromboplastin time | 568 | 40 (36, 45) | 39 (35, 43) |
| hs_CRP | 737 | 114 (65, 191) | 7 (2, 35) |
| HIV antibody quantification | 278 | 0.08 (0.07, 0.11) | 0.09 (0.08, 0.11) |
| serum sodium | 975 | 142 (138, 148) | 140 (138, 141) |
| thrombocytocrit | 862 | 0.15 (0.10, 0.21) | 0.24 (0.19, 0.30) |
| ESR | 383 | 36 (16, 59) | 26 (13, 40) |
| glutamic-pyruvic transaminase | 931 | 26 (18, 44) | 21 (15, 36) |
| eGFR | 936 | 72 (43, 91) | 100 (85, 114) |
| creatinine | 936 | 88 (68, 130) | 64 (54, 83) |
The blood tests are prepared for further analysis. For each patient there were many blood samples, containing many missing values. All the samples have been combined into one sample containing the last value (closest to discharge).
last_sample_df <- blood_tests_df %>%
select(-RE_DATE) %>%
group_by(PATIENT_ID) %>%
summarise(across(everything(), function(x) last(na.omit(x)))) %>%
select(-PATIENT_ID)
The combined blood samples dataset was also preprocessed for classification. Go to section Classification - dataset cleaning to see how it was cleaned. Columns and patients with too many missing values were deleted from the dataset.
# %>% na_mean(option = "median")
class_df <- last_sample_df
ggplot(patients, aes(x = gender, fill = gender)) +
geom_bar() +
labs(y = "Number of patients",
x = "Gender") +
theme(legend.position = "none")
patients_hist <- ggplot(patients, aes(x = age, fill = gender)) +
geom_histogram(stat = "count",
binwidth = 1.2)+
labs(y = "Number of patients",
x = "Age") +
scale_x_continuous(breaks=seq(20, 100, 5))
ggplotly(patients_hist)
layout_ggplotly <- function(gg, x = -0.05, y = -0.05){
# The 1 and 2 goes into the list that contains the options for the x and y axis labels respectively
gg[['y']][['layout']][['annotations']][[1]][['y']] <- x
gg[['y']][['layout']][['annotations']][[2]][['x']] <- y
gg
}
patients_outcome <- ggplot(patients, aes(x = age, fill = outcome)) +
geom_histogram(binwidth = 1.2) +
facet_grid(~ gender) +
scale_y_continuous(breaks=seq(0, 20, 2)) +
scale_x_continuous(breaks=seq(20, 100, 5)) +
labs(y = "Number of patients", x = "Age")
ggplotly(patients_outcome)
hospitalization_length_plot <- ggplot(patients, aes(x = hospitalization_length, fill = outcome)) +
geom_histogram(binwidth = 1.2) +
facet_grid(outcome ~ gender) +
scale_y_continuous(breaks=seq(0, 20, 2)) +
scale_x_continuous(breaks=seq(0, 40, 5)) +
labs(y = "Number of patients",
x = "Hospitalization length [days]")
ggplotly(hospitalization_length_plot)
outcome_per_day <- patients %>%
mutate(discharge_time = as.Date(discharge_time)) %>%
filter(outcome == "Died")
outcome_per_day_plot <- ggplot(outcome_per_day, aes(x = discharge_time, fill = outcome)) +
geom_histogram(binwidth = 1.2) +
facet_grid(~ gender) +
labs(x = "Discharge date", y = "Number of deaths") +
theme(legend.position = "none")
ggplotly(outcome_per_day_plot)
outcome_during_day_plot <- patients %>%
mutate(time_h_m = hms(format(patients$discharge_time, format = "%H:%M:%S"))) %>%
mutate(time_h_m = (hour(time_h_m) + minute(time_h_m)/60)) %>%
filter(outcome == "Died") %>%
ggplot(aes(x = time_h_m, fill = "blue")) +
geom_histogram(binwidth = 1.2) +
scale_x_continuous(breaks = seq(0, 24, by = 1)) +
labs(x = "Number of dead cases", y = "Time of the day")+
theme(legend.position = "none")
ggplotly(outcome_during_day_plot)
Preparing the dataset for correlation (changing factor variables to numeric).
cor_df <- last_sample_df %>%
mutate(outcome = ifelse(last_sample_df$outcome == "Died", 1, 0)) %>%
mutate(gender = ifelse(last_sample_df$gender == "male", 1, 0)) %>%
rename(male = gender)
correlationMatrix <- correlate(cor_df[sapply(cor_df, is.numeric)], use='pairwise.complete.obs')
From the previous analysis, it is known that elderly people are more susceptible to die due to Covid-19. Below short summary, what biomarkers are highly correlated with age.
age_correlation <- correlationMatrix %>%
focus(age) %>%
mutate(age = abs(age)) %>%
arrange(desc(age)) %>%
filter(rowname != "outcome") %>% head
age_correlation %>% kbl() %>% kable_paper("hover")
| rowname | age |
|---|---|
| eGFR | 0.6119405 |
| (%)lymphocyte | 0.5171992 |
| neutrophils(%) | 0.4885978 |
| albumin | 0.4870837 |
| hs_CRP | 0.4299006 |
| neutrophils count | 0.4073328 |
The most correlated is eGFR which is used to measure the the effectiveness of the work of the kidneys. Its hard to present a norm value, because this marker depends on many factors like gender, age, body mass, but some sources show that value above 90 is proper. Too low ,and too high value of GFR in some cases indicate kidney diseases which affect the blood filtration.
Below chart presents the GFR value between patients in different age, grouped by outcome. It’s analysis shows, that many elderly patients that died, had some abnormalities in the work of the kidneys.
ggplot(last_sample_df, aes(x = age, y = `eGFR`, color = outcome)) +
geom_point() +
theme(legend.position = c(0.9,0.9)) +
ylim(0 , 150)
The next two high correlated biomarkers are related to immune system. The values of lymphocyte and neutrophils show how strong the organism is and how well it fights with the disease.
Lymphocytes are cells responsible for protecting our body (by creating anitbodies) from viruses, bacteria and other disease causing factors. The norm value for an adult is between 15 - 40%. Lower lymphocytes levels means, that the body cannot fight the disease. The left chart below confirms, that elderly people have weaker immune system and it’s hard for their organism to fight the disease.
plot1 <- ggplot(last_sample_df, aes(x = age, y = `(%)lymphocyte`, color = outcome)) + geom_point() + theme(legend.position = "none")
plot2 <- ggplot(last_sample_df, aes(x = `lymphocyte count`, y = `(%)lymphocyte`, color = outcome)) +
geom_point() +
theme(legend.position = c(0.8, 0.2)) +
xlim(0,3.75)
grid.arrange(plot1, plot2, ncol=2)
Neutrophils are essential part of immune system - this cells search for pathogens in organisms and destroy them. High value of neutrophils(%) results in many neutrophil cells in blood (right plot below), which means that a medical condition occurs in patients body and that the immune system fights it.
This correlation explains that elderly people are more vulnerable, and their immune systems need to produce more neutrophils to fight the pathogens than younger patients. The left plot shows that some of the tested patients had some medical condition, due to increased amount of neutrophils. Adding the information about the outcome, confirms that elderly patients are more likely to die because of Covid-19.
plot1 <- ggplot(last_sample_df, aes(x = age, y = `neutrophils(%)`, color = outcome)) + geom_point() + theme(legend.position = "none")
plot2 <- ggplot(last_sample_df, aes(x = `neutrophils count`, y = `neutrophils(%)`, color = outcome)) + geom_point() + theme(legend.position = c(0.8, 0.2))
grid.arrange(plot1, plot2, ncol=2)
The following section is devoted to check the correlation between biomarkers and the outcome.
The correlation matrix for the highest correlated variables and the numeric correlation values are shown below.
'%ni%' <- Negate('%in%')
outcome_cor <- correlationMatrix %>%
focus(outcome) %>%
mutate(outcome = abs(outcome)) %>%
arrange(desc(outcome)) %>%
filter(`rowname` %ni% c('neutrophils(%)', 'neutrophils count')) %>%
mutate(outcome = round(outcome,2)) %>%
filter(outcome > 0.5)
outcome_corr_df <- cor_df %>% select(c(outcome_cor$rowname, outcome))
outcome_cor_matrix <- cor(outcome_corr_df[sapply(outcome_corr_df, is.numeric)], use='pairwise.complete.obs')
corrplot(outcome_cor_matrix)
The previous sections contains the analysis about lymphocytes and how important they are when fighting the disease, that’s why they won’t be considered in this section.
outcome_cor %>% kbl() %>% kable_paper("hover")
| rowname | outcome |
|---|---|
| (%)lymphocyte | 0.76 |
| hs_CRP | 0.72 |
| albumin | 0.72 |
| Lactate dehydrogenase | 0.69 |
| Prothrombin activity | 0.68 |
| D-D dimer | 0.68 |
| Fibrin degradation products | 0.66 |
| calcium | 0.64 |
| Platelet count | 0.58 |
| age | 0.56 |
| eosinophils(%) | 0.55 |
| HCO3- | 0.54 |
| thrombocytocrit | 0.53 |
| monocytes(%) | 0.51 |
Below in each tab are presented the values of each biomarkers (correlation > 0.65) for all the patients grouped by age and outcome. Analysis of theses data shows, that all the biomarkers are also somehow correlated with the age, because the biomarkers values for eldery are very often (in this 5 biomarkers) outstanding from the values for people less than 50 years. This statement is confirmed by the boxplots below every chart, preseting the distribution of the biomarkers grouped by age group (adult - less than 64 years, eldery - more than 64 years), gender and outcome.
layout_ggplotly <- function(gg, x = -0.02, y = -0.05){
# The 1 and 2 goes into the list that contains the options for the x and y axis labels respectively
gg[['x']][['layout']][['annotations']][[1]][['y']] <- x
gg[['x']][['layout']][['annotations']][[2]][['x']] <- y
gg
}
ggplot(last_sample_df, aes(x = age, y = `albumin`, color = outcome)) + geom_point()
albumin_plot <- last_sample_df %>%
mutate(age_group = as.factor(ifelse(last_sample_df$age < 64, 'adult', 'elderly'))) %>%
ggplot(aes(x= age_group, y = `albumin`, fill = gender)) +
geom_boxplot(na.rm=TRUE) + facet_grid(~outcome) +
labs(x = "Age group", y = "Albumin")
ggplotly(albumin_plot) %>% layout(boxmode = "group") %>% layout_ggplotly
ggplot(last_sample_df, aes(x = age, y = `Prothrombin activity`, color = outcome)) + geom_point()
pt_plot <- last_sample_df %>%
mutate(age_group = as.factor(ifelse(last_sample_df$age < 64, 'adult', 'elderly'))) %>%
ggplot(aes(x= age_group, y = `Prothrombin activity`, fill = gender)) +
geom_boxplot(na.rm=TRUE) + facet_grid(~outcome) +
labs(x = "Age group", y = "Prothrombin activity")
ggplotly(pt_plot) %>% layout(boxmode = "group") %>% layout_ggplotly
A norm value for hs-CRP is about 50. All values above that level indicate some kind of inflammation in the body. Many values on the first plot are much more above the norm level showing very strong inflammation which eventually (probably) contributed to the death.
ggplot(last_sample_df, aes(x = age, y = hs_CRP, color = outcome)) + geom_point()
crp_plot <- last_sample_df %>%
mutate(age_group = as.factor(ifelse(last_sample_df$age < 64, 'adult', 'elderly'))) %>%
ggplot(aes(x= age_group, y = hs_CRP, fill = gender)) +
geom_boxplot(na.rm=TRUE) + facet_grid(~outcome) +
labs(x = "Age group", y = "High sensitivity C-reactive protein")
ggplotly(crp_plot) %>% layout(boxmode = "group") %>% layout_ggplotly
D-dimers are cells responsible for decomposition of a clot. Their high value mean that there was a blood clot in the organism. Sometimes it can be linked with myocardial infarction, pulmonary embolism which combined with Covid-19 symptoms can lead to death.
ggplot(last_sample_df, aes(x = age, y = `D-D dimer`, color = outcome)) + geom_point()
dimer_plot <- last_sample_df %>%
mutate(age_group = as.factor(ifelse(last_sample_df$age < 64, 'adult', 'elderly'))) %>%
ggplot(aes(x= age_group, y = `D-D dimer`, fill = gender)) +
geom_boxplot(na.rm=TRUE) + facet_grid(~outcome) +
labs(x = "Age group", y = "D-D dimer")
ggplotly(dimer_plot) %>% layout(boxmode = "group") %>% layout_ggplotly
ggplot(last_sample_df, aes(x = age, y = `Lactate dehydrogenase`, color = outcome)) + geom_point()
ldh_plot <- last_sample_df %>%
mutate(age_group = as.factor(ifelse(last_sample_df$age < 64, 'adult', 'elderly'))) %>%
ggplot(aes(x= age_group, y = `Lactate dehydrogenase`, fill = gender)) +
geom_boxplot(na.rm=TRUE) + facet_grid(~outcome) +
labs(x = "Age group", y = "Lactate dehydrogenase")
ggplotly(ldh_plot) %>% layout(boxmode = "group") %>% layout_ggplotly
Animated aggregate number of deaths in next days is presented below. A shoot up can be noticed between 02.02.2020 - 22.02.2020. After that the deaths levelled off, and another peak occcured on 04.04.2020.
patients_agg <- patients %>% select(c(discharge_time, outcome)) %>%
mutate(discharge_time = as.Date(patients$discharge_time, "%m/%d/%Y" )) %>%
filter(outcome == 'Died') %>%
group_by(discharge_time) %>%
summarise(deaths_count = n(), .groups="drop") %>%
arrange(discharge_time) %>%
mutate(deaths_count_agg = cumsum(deaths_count))
ggplot(patients_agg, aes(x = discharge_time, y = deaths_count_agg)) +
geom_line(size = 1.1, color = 'red') +
transition_reveal(discharge_time) +
labs(x = "Discharde time", y = "Deaths count aggregate") +
scale_x_continuous(breaks = seq(min(patients_agg$discharge_time), max(patients_agg$discharge_time), 10))
In this chapter classification model is trained to predict the outcome (death/survival) of COVID-19 sick patients based on basic patients observations and blood test samples. One blood test for each patient is considered as an observation for the machine learning algorithm. As it was explained extra data pre processing was needed to prepare the dataset. For each patient, all the blood test are reduced to one row, containing the closest value to the discharge time.
Redundant columns like patient id, blood test time and admission and discharge time were removed from the dataset.
For machine learning process there should be no missing values in the dataset. Summary below shows, that there are columns with many missing values.
class_df %>% select(-c(age,gender, outcome)) %>% summary %>% kbl %>% kable_paper("hover") %>% scroll_box(width = "100%", height = "300px")
| Hypersensitive cardiac troponinI | hemoglobin | Serum chloride | Prothrombin time | procalcitonin | eosinophils(%) | Interleukin 2 receptor | Alkaline phosphatase | albumin | basophil(%) | Interleukin 10 | Total bilirubin | Platelet count | monocytes(%) | antithrombin | Interleukin 8 | indirect bilirubin | Red blood cell distribution width | neutrophils(%) | total protein | Quantification of Treponema pallidum antibodies | Prothrombin activity | HBsAg | mean corpuscular volume | hematocrit | White blood cell count | Tumor necrosis factor alpha | mean corpuscular hemoglobin concentration | fibrinogen | Interleukin 1 beta | Urea | lymphocyte count | PH value | Red blood cell count | Eosinophil count | Corrected calcium | Serum potassium | glucose | neutrophils count | Direct bilirubin | Mean platelet volume | ferritin | RBC distribution width SD | Thrombin time | (%)lymphocyte | HCV antibody quantification | D-D dimer | Total cholesterol | aspartate aminotransferase | Uric acid | HCO3- | calcium | Amino-terminal brain natriuretic peptide precursor(NT-proBNP) | Lactate dehydrogenase | platelet large cell ratio | Interleukin 6 | Fibrin degradation products | monocytes count | PLT distribution width | globulin | Gamma glutamyl transpeptidase | International standard ratio | basophil count(#) | 2019-nCoV nucleic acid detection | mean corpuscular hemoglobin | Activation of partial thromboplastin time | hs_CRP | HIV antibody quantification | serum sodium | thrombocytocrit | ESR | glutamic-pyruvic transaminase | eGFR | creatinine | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.90 | Min. : 6.4 | Min. : 77.70 | Min. :11.50 | Min. : 0.020 | Min. :0.000 | Min. : 61.0 | Min. : 17.00 | Min. :13.60 | Min. :0.0000 | Min. : 5.00 | Min. : 2.80 | Min. : -1 | Min. : 0.600 | Min. : 20.00 | Min. : 5.00 | Min. : 0.100 | Min. :10.60 | Min. : 1.90 | Min. :31.80 | Min. : 0.0200 | Min. : 7.00 | Min. : 0.000 | Min. : 62.30 | Min. :15.60 | Min. : 0.71 | Min. : 4.000 | Min. :286.0 | Min. :0.500 | Min. : 5.000 | Min. : 1.70 | Min. : 0.050 | Min. :5.000 | Min. : 0.100 | Min. :0.00000 | Min. :1.650 | Min. :2.760 | Min. : 1.000 | Min. : 0.320 | Min. : 1.600 | Min. : 8.50 | Min. : 17.8 | Min. : 31.30 | Min. : 13.00 | Min. : 0.300 | Min. :0.0200 | Min. : 0.2100 | Min. :0.100 | Min. : 6.00 | Min. : 52.0 | Min. : 6.30 | Min. :1.170 | Min. : 5.0 | Min. : 110.0 | Min. :11.20 | Min. : 1.500 | Min. : 4.00 | Min. : 0.010 | Min. : 8.10 | Min. :10.10 | Min. : 7.00 | Min. : 0.840 | Min. :0.0000 | Min. :-1 | Min. :20.80 | Min. : 21.80 | Min. : 0.10 | Min. :0.05000 | Min. :121.1 | Min. :0.0100 | Min. : 1.0 | Min. : 5.00 | Min. : 2.00 | Min. : 14.0 | |
| 1st Qu.: 2.45 | 1st Qu.:112.0 | 1st Qu.: 99.53 | 1st Qu.:13.40 | 1st Qu.: 0.030 | 1st Qu.:0.000 | 1st Qu.: 457.8 | 1st Qu.: 54.00 | 1st Qu.:28.20 | 1st Qu.:0.1000 | 1st Qu.: 5.00 | 1st Qu.: 7.20 | 1st Qu.:113 | 1st Qu.: 2.975 | 1st Qu.: 76.25 | 1st Qu.: 8.10 | 1st Qu.: 3.700 | 1st Qu.:12.03 | 1st Qu.:61.73 | 1st Qu.:61.20 | 1st Qu.: 0.0400 | 1st Qu.: 67.00 | 1st Qu.: 0.000 | 1st Qu.: 86.90 | 1st Qu.:33.00 | 1st Qu.: 5.12 | 1st Qu.: 6.675 | 1st Qu.:332.0 | 1st Qu.:3.183 | 1st Qu.: 5.000 | 1st Qu.: 3.80 | 1st Qu.: 0.520 | 1st Qu.:6.000 | 1st Qu.: 3.550 | 1st Qu.:0.00000 | 1st Qu.:2.260 | 1st Qu.:4.032 | 1st Qu.: 5.120 | 1st Qu.: 3.100 | 1st Qu.: 3.100 | 1st Qu.:10.10 | 1st Qu.: 402.0 | 1st Qu.: 38.80 | 1st Qu.: 15.60 | 1st Qu.: 4.175 | 1st Qu.:0.0400 | 1st Qu.: 0.4925 | 1st Qu.:2.950 | 1st Qu.: 19.00 | 1st Qu.: 198.8 | 1st Qu.:20.90 | 1st Qu.:1.990 | 1st Qu.: 58.5 | 1st Qu.: 199.0 | 1st Qu.:25.32 | 1st Qu.: 3.955 | 1st Qu.: 4.00 | 1st Qu.: 0.310 | 1st Qu.:10.93 | 1st Qu.:28.98 | 1st Qu.: 21.00 | 1st Qu.: 1.018 | 1st Qu.:0.0100 | 1st Qu.:-1 | 1st Qu.:29.70 | 1st Qu.: 35.15 | 1st Qu.: 2.00 | 1st Qu.:0.07000 | 1st Qu.:138.3 | 1st Qu.:0.1400 | 1st Qu.: 13.0 | 1st Qu.: 17.00 | 1st Qu.: 66.70 | 1st Qu.: 58.0 | |
| Median : 12.30 | Median :125.0 | Median :102.30 | Median :14.30 | Median : 0.100 | Median :0.250 | Median : 663.5 | Median : 71.00 | Median :33.20 | Median :0.2000 | Median : 5.20 | Median : 10.60 | Median :192 | Median : 6.250 | Median : 87.00 | Median : 14.75 | Median : 5.300 | Median :12.75 | Median :77.55 | Median :66.00 | Median : 0.0500 | Median : 86.50 | Median : 0.010 | Median : 90.40 | Median :36.30 | Median : 7.93 | Median : 8.300 | Median :342.0 | Median :4.220 | Median : 5.000 | Median : 5.40 | Median : 0.990 | Median :6.000 | Median : 4.100 | Median :0.02000 | Median :2.370 | Median :4.430 | Median : 6.540 | Median : 5.380 | Median : 4.800 | Median :10.80 | Median : 759.7 | Median : 41.20 | Median : 16.55 | Median :14.350 | Median :0.0600 | Median : 1.3300 | Median :3.720 | Median : 25.00 | Median : 260.0 | Median :23.90 | Median :2.110 | Median : 304.0 | Median : 273.5 | Median :30.85 | Median : 18.010 | Median : 5.80 | Median : 0.430 | Median :12.50 | Median :32.40 | Median : 33.00 | Median : 1.095 | Median :0.0200 | Median :-1 | Median :30.90 | Median : 38.90 | Median : 26.30 | Median :0.09000 | Median :140.7 | Median :0.2100 | Median : 28.0 | Median : 26.00 | Median : 89.35 | Median : 74.0 | |
| Mean : 795.91 | Mean :124.3 | Mean :103.30 | Mean :16.04 | Mean : 1.095 | Mean :0.902 | Mean : 934.6 | Mean : 85.62 | Mean :32.67 | Mean :0.2646 | Mean : 12.89 | Mean : 16.50 | Mean :193 | Mean : 6.525 | Mean : 86.36 | Mean : 95.37 | Mean : 6.757 | Mean :13.22 | Mean :75.39 | Mean :65.28 | Mean : 0.1332 | Mean : 81.25 | Mean : 8.427 | Mean : 90.61 | Mean :36.58 | Mean : 18.93 | Mean : 11.929 | Mean :342.1 | Mean :4.305 | Mean : 6.716 | Mean : 9.88 | Mean : 1.166 | Mean :6.348 | Mean : 8.449 | Mean :0.05379 | Mean :2.347 | Mean :4.500 | Mean : 8.525 | Mean : 8.001 | Mean : 9.767 | Mean :10.98 | Mean : 1519.3 | Mean : 42.83 | Mean : 17.72 | Mean :16.913 | Mean :0.1119 | Mean : 6.2456 | Mean :3.748 | Mean : 54.22 | Mean : 296.1 | Mean :23.20 | Mean :2.096 | Mean : 3772.4 | Mean : 476.5 | Mean :32.22 | Mean : 127.050 | Mean : 46.73 | Mean : 0.596 | Mean :13.23 | Mean :32.58 | Mean : 49.44 | Mean : 1.298 | Mean :0.0214 | Mean :-1 | Mean :31.01 | Mean : 41.27 | Mean : 64.86 | Mean :0.09931 | Mean :141.8 | Mean :0.2131 | Mean : 33.6 | Mean : 42.66 | Mean : 81.74 | Mean : 119.7 | |
| 3rd Qu.: 79.85 | 3rd Qu.:138.0 | 3rd Qu.:105.58 | 3rd Qu.:16.30 | 3rd Qu.: 0.450 | 3rd Qu.:1.500 | 3rd Qu.:1172.5 | 3rd Qu.: 98.00 | 3rd Qu.:37.62 | 3rd Qu.:0.4000 | 3rd Qu.: 11.90 | 3rd Qu.: 16.12 | 3rd Qu.:257 | 3rd Qu.: 8.900 | 3rd Qu.: 98.00 | 3rd Qu.: 34.42 | 3rd Qu.: 7.900 | 3rd Qu.:13.80 | 3rd Qu.:91.92 | 3rd Qu.:70.42 | 3rd Qu.: 0.0700 | 3rd Qu.: 98.00 | 3rd Qu.: 0.010 | 3rd Qu.: 94.22 | 3rd Qu.:40.12 | 3rd Qu.: 13.20 | 3rd Qu.: 11.600 | 3rd Qu.:349.0 | 3rd Qu.:5.410 | 3rd Qu.: 5.000 | 3rd Qu.:11.53 | 3rd Qu.: 1.540 | 3rd Qu.:7.000 | 3rd Qu.: 4.650 | 3rd Qu.:0.09000 | 3rd Qu.:2.450 | 3rd Qu.:4.817 | 3rd Qu.: 9.915 | 3rd Qu.:11.242 | 3rd Qu.: 7.425 | 3rd Qu.:11.60 | 3rd Qu.: 1436.6 | 3rd Qu.: 45.27 | 3rd Qu.: 17.90 | 3rd Qu.:27.525 | 3rd Qu.:0.0900 | 3rd Qu.:12.0175 | 3rd Qu.:4.380 | 3rd Qu.: 41.00 | 3rd Qu.: 349.1 | 3rd Qu.:26.32 | 3rd Qu.:2.220 | 3rd Qu.: 1921.0 | 3rd Qu.: 617.8 | 3rd Qu.:37.75 | 3rd Qu.: 61.123 | 3rd Qu.:101.78 | 3rd Qu.: 0.610 | 3rd Qu.:14.50 | 3rd Qu.:35.80 | 3rd Qu.: 55.00 | 3rd Qu.: 1.302 | 3rd Qu.:0.0300 | 3rd Qu.:-1 | 3rd Qu.:32.20 | 3rd Qu.: 44.20 | 3rd Qu.: 99.10 | 3rd Qu.:0.11000 | 3rd Qu.:143.3 | 3rd Qu.:0.2775 | 3rd Qu.: 47.0 | 3rd Qu.: 42.00 | 3rd Qu.:105.00 | 3rd Qu.: 97.0 | |
| Max. :50000.00 | Max. :178.0 | Max. :140.40 | Max. :92.10 | Max. :57.170 | Max. :8.600 | Max. :7500.0 | Max. :620.00 | Max. :47.60 | Max. :1.7000 | Max. :500.00 | Max. :295.40 | Max. :554 | Max. :53.000 | Max. :136.00 | Max. :6795.00 | Max. :59.700 | Max. :27.10 | Max. :98.90 | Max. :83.40 | Max. :11.9500 | Max. :142.00 | Max. :250.000 | Max. :117.60 | Max. :52.30 | Max. :1726.60 | Max. :168.000 | Max. :488.0 | Max. :8.950 | Max. :88.500 | Max. :68.40 | Max. :33.690 | Max. :7.565 | Max. :749.500 | Max. :0.46000 | Max. :2.790 | Max. :9.860 | Max. :38.820 | Max. :32.220 | Max. :242.900 | Max. :15.00 | Max. :50000.0 | Max. :113.30 | Max. :144.90 | Max. :48.500 | Max. :2.0900 | Max. :21.0000 | Max. :7.300 | Max. :1858.00 | Max. :1176.0 | Max. :33.80 | Max. :2.600 | Max. :70000.0 | Max. :1867.0 | Max. :62.20 | Max. :5000.000 | Max. :190.80 | Max. :39.920 | Max. :25.30 | Max. :49.20 | Max. :732.00 | Max. :11.570 | Max. :0.1200 | Max. :-1 | Max. :50.80 | Max. :106.40 | Max. :320.00 | Max. :0.27000 | Max. :179.5 | Max. :0.5100 | Max. :110.0 | Max. :1508.00 | Max. :206.90 | Max. :1497.0 | |
| NA’s :74 | NA’s :5 | NA’s :7 | NA’s :9 | NA’s :48 | NA’s :5 | NA’s :145 | NA’s :5 | NA’s :5 | NA’s :5 | NA’s :146 | NA’s :5 | NA’s :5 | NA’s :5 | NA’s :159 | NA’s :145 | NA’s :6 | NA’s :11 | NA’s :5 | NA’s :5 | NA’s :86 | NA’s :9 | NA’s :86 | NA’s :5 | NA’s :5 | NA’s :4 | NA’s :145 | NA’s :5 | NA’s :63 | NA’s :145 | NA’s :5 | NA’s :5 | NA’s :129 | NA’s :4 | NA’s :5 | NA’s :8 | NA’s :7 | NA’s :10 | NA’s :5 | NA’s :5 | NA’s :15 | NA’s :148 | NA’s :11 | NA’s :63 | NA’s :5 | NA’s :86 | NA’s :19 | NA’s :5 | NA’s :5 | NA’s :5 | NA’s :5 | NA’s :7 | NA’s :94 | NA’s :5 | NA’s :15 | NA’s :143 | NA’s :159 | NA’s :5 | NA’s :15 | NA’s :5 | NA’s :5 | NA’s :9 | NA’s :5 | NA’s :143 | NA’s :5 | NA’s :63 | NA’s :8 | NA’s :87 | NA’s :7 | NA’s :15 | NA’s :73 | NA’s :5 | NA’s :5 | NA’s :5 |
Below cleaning is done, to check if the dataset contains patients with basic info like age and gender, but with no many missing biomarker values - these patients are removed from the dataset.
#Deleting rows with no many missing values
rows_to_delete <- c()
for(i in 1:nrow(class_df)) {
row_na_sum <- sum(is.na(class_df[i,]))
if (row_na_sum >= 35) {
rows_to_delete <- c(rows_to_delete, i)
}
}
patients_to_delete <- length(rows_to_delete)
class_df <- class_df[-rows_to_delete, ]
class_df %>% select(-c(age,gender, outcome)) %>% summary %>% kbl %>% kable_paper("hover") %>% scroll_box(width = "100%", height = "300px")
| Hypersensitive cardiac troponinI | hemoglobin | Serum chloride | Prothrombin time | procalcitonin | eosinophils(%) | Interleukin 2 receptor | Alkaline phosphatase | albumin | basophil(%) | Interleukin 10 | Total bilirubin | Platelet count | monocytes(%) | antithrombin | Interleukin 8 | indirect bilirubin | Red blood cell distribution width | neutrophils(%) | total protein | Quantification of Treponema pallidum antibodies | Prothrombin activity | HBsAg | mean corpuscular volume | hematocrit | White blood cell count | Tumor necrosis factor alpha | mean corpuscular hemoglobin concentration | fibrinogen | Interleukin 1 beta | Urea | lymphocyte count | PH value | Red blood cell count | Eosinophil count | Corrected calcium | Serum potassium | glucose | neutrophils count | Direct bilirubin | Mean platelet volume | ferritin | RBC distribution width SD | Thrombin time | (%)lymphocyte | HCV antibody quantification | D-D dimer | Total cholesterol | aspartate aminotransferase | Uric acid | HCO3- | calcium | Amino-terminal brain natriuretic peptide precursor(NT-proBNP) | Lactate dehydrogenase | platelet large cell ratio | Interleukin 6 | Fibrin degradation products | monocytes count | PLT distribution width | globulin | Gamma glutamyl transpeptidase | International standard ratio | basophil count(#) | 2019-nCoV nucleic acid detection | mean corpuscular hemoglobin | Activation of partial thromboplastin time | hs_CRP | HIV antibody quantification | serum sodium | thrombocytocrit | ESR | glutamic-pyruvic transaminase | eGFR | creatinine | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.90 | Min. : 6.4 | Min. : 77.7 | Min. :11.50 | Min. : 0.020 | Min. :0.0000 | Min. : 61.0 | Min. : 17.00 | Min. :13.60 | Min. :0.0000 | Min. : 5.00 | Min. : 2.80 | Min. : -1.0 | Min. : 0.600 | Min. : 20.00 | Min. : 5.00 | Min. : 0.100 | Min. :10.60 | Min. : 1.90 | Min. :31.80 | Min. : 0.0200 | Min. : 7.00 | Min. : 0.000 | Min. : 62.30 | Min. :15.60 | Min. : 0.710 | Min. : 4.000 | Min. :286.0 | Min. :0.50 | Min. : 5.000 | Min. : 1.70 | Min. : 0.050 | Min. :5.000 | Min. : 0.100 | Min. :0.00000 | Min. :1.650 | Min. :2.760 | Min. : 1.000 | Min. : 0.320 | Min. : 1.600 | Min. : 8.50 | Min. : 17.8 | Min. : 31.30 | Min. : 13.00 | Min. : 0.30 | Min. :0.0200 | Min. : 0.210 | Min. :0.100 | Min. : 6.00 | Min. : 52.0 | Min. : 6.30 | Min. :1.170 | Min. : 5 | Min. : 110.0 | Min. :11.20 | Min. : 1.500 | Min. : 4.00 | Min. : 0.0100 | Min. : 8.10 | Min. :10.10 | Min. : 7.00 | Min. : 0.840 | Min. :0.00000 | Min. :-1 | Min. :20.80 | Min. : 21.80 | Min. : 0.10 | Min. :0.05000 | Min. :121.1 | Min. :0.010 | Min. : 1.00 | Min. : 5.00 | Min. : 2.00 | Min. : 14.0 | |
| 1st Qu.: 2.45 | 1st Qu.:112.0 | 1st Qu.: 99.6 | 1st Qu.:13.40 | 1st Qu.: 0.030 | 1st Qu.:0.0000 | 1st Qu.: 457.8 | 1st Qu.: 54.00 | 1st Qu.:28.20 | 1st Qu.:0.1000 | 1st Qu.: 5.00 | 1st Qu.: 7.20 | 1st Qu.:113.0 | 1st Qu.: 2.950 | 1st Qu.: 76.00 | 1st Qu.: 8.10 | 1st Qu.: 3.725 | 1st Qu.:12.00 | 1st Qu.:61.85 | 1st Qu.:61.20 | 1st Qu.: 0.0400 | 1st Qu.: 67.00 | 1st Qu.: 0.000 | 1st Qu.: 86.95 | 1st Qu.:33.00 | 1st Qu.: 5.115 | 1st Qu.: 6.675 | 1st Qu.:332.0 | 1st Qu.:3.18 | 1st Qu.: 5.000 | 1st Qu.: 3.80 | 1st Qu.: 0.520 | 1st Qu.:6.000 | 1st Qu.: 3.550 | 1st Qu.:0.00000 | 1st Qu.:2.260 | 1st Qu.:4.030 | 1st Qu.: 5.120 | 1st Qu.: 3.110 | 1st Qu.: 3.100 | 1st Qu.:10.10 | 1st Qu.: 402.0 | 1st Qu.: 38.80 | 1st Qu.: 15.60 | 1st Qu.: 4.15 | 1st Qu.:0.0400 | 1st Qu.: 0.490 | 1st Qu.:2.950 | 1st Qu.: 19.00 | 1st Qu.: 198.5 | 1st Qu.:20.90 | 1st Qu.:1.990 | 1st Qu.: 57 | 1st Qu.: 198.0 | 1st Qu.:25.30 | 1st Qu.: 3.955 | 1st Qu.: 4.00 | 1st Qu.: 0.3100 | 1st Qu.:11.00 | 1st Qu.:28.95 | 1st Qu.: 21.00 | 1st Qu.: 1.015 | 1st Qu.:0.01000 | 1st Qu.:-1 | 1st Qu.:29.70 | 1st Qu.: 35.10 | 1st Qu.: 2.00 | 1st Qu.:0.07000 | 1st Qu.:138.3 | 1st Qu.:0.140 | 1st Qu.: 13.50 | 1st Qu.: 17.00 | 1st Qu.: 66.75 | 1st Qu.: 58.0 | |
| Median : 12.30 | Median :125.0 | Median :102.3 | Median :14.30 | Median : 0.095 | Median :0.2000 | Median : 663.5 | Median : 71.00 | Median :33.20 | Median :0.2000 | Median : 5.20 | Median : 10.60 | Median :190.0 | Median : 6.200 | Median : 87.00 | Median : 14.75 | Median : 5.300 | Median :12.70 | Median :77.80 | Median :66.00 | Median : 0.0500 | Median : 86.00 | Median : 0.010 | Median : 90.40 | Median :36.30 | Median : 7.930 | Median : 8.300 | Median :342.0 | Median :4.22 | Median : 5.000 | Median : 5.40 | Median : 0.990 | Median :6.000 | Median : 4.100 | Median :0.02000 | Median :2.370 | Median :4.430 | Median : 6.540 | Median : 5.390 | Median : 4.800 | Median :10.80 | Median : 759.7 | Median : 41.20 | Median : 16.50 | Median :14.20 | Median :0.0600 | Median : 1.330 | Median :3.720 | Median : 25.00 | Median : 260.0 | Median :23.90 | Median :2.110 | Median : 290 | Median : 274.0 | Median :30.90 | Median : 18.010 | Median : 5.80 | Median : 0.4300 | Median :12.50 | Median :32.40 | Median : 33.00 | Median : 1.100 | Median :0.02000 | Median :-1 | Median :30.90 | Median : 38.90 | Median : 26.50 | Median :0.09000 | Median :140.7 | Median :0.210 | Median : 28.00 | Median : 26.00 | Median : 89.40 | Median : 74.0 | |
| Mean : 795.91 | Mean :124.4 | Mean :103.3 | Mean :16.05 | Mean : 1.098 | Mean :0.8994 | Mean : 934.6 | Mean : 85.68 | Mean :32.65 | Mean :0.2631 | Mean : 12.89 | Mean : 16.54 | Mean :192.9 | Mean : 6.519 | Mean : 86.32 | Mean : 95.37 | Mean : 6.771 | Mean :13.21 | Mean :75.47 | Mean :65.25 | Mean : 0.1335 | Mean : 81.22 | Mean : 8.489 | Mean : 90.62 | Mean :36.58 | Mean : 18.974 | Mean : 11.929 | Mean :342.1 | Mean :4.30 | Mean : 6.716 | Mean : 9.85 | Mean : 1.163 | Mean :6.347 | Mean : 8.326 | Mean :0.05369 | Mean :2.347 | Mean :4.501 | Mean : 8.525 | Mean : 8.017 | Mean : 9.788 | Mean :10.98 | Mean : 1519.3 | Mean : 42.83 | Mean : 17.72 | Mean :16.84 | Mean :0.1116 | Mean : 6.262 | Mean :3.745 | Mean : 54.35 | Mean : 295.4 | Mean :23.20 | Mean :2.095 | Mean : 3757 | Mean : 477.3 | Mean :32.24 | Mean : 127.050 | Mean : 46.95 | Mean : 0.5964 | Mean :13.24 | Mean :32.56 | Mean : 49.53 | Mean : 1.299 | Mean :0.02135 | Mean :-1 | Mean :31.01 | Mean : 41.27 | Mean : 64.98 | Mean :0.09912 | Mean :141.8 | Mean :0.213 | Mean : 33.68 | Mean : 42.76 | Mean : 81.95 | Mean : 117.9 | |
| 3rd Qu.: 79.85 | 3rd Qu.:138.0 | 3rd Qu.:105.6 | 3rd Qu.:16.30 | 3rd Qu.: 0.450 | 3rd Qu.:1.5000 | 3rd Qu.:1172.5 | 3rd Qu.: 98.00 | 3rd Qu.:37.60 | 3rd Qu.:0.4000 | 3rd Qu.: 11.90 | 3rd Qu.: 16.15 | 3rd Qu.:257.0 | 3rd Qu.: 8.900 | 3rd Qu.: 98.00 | 3rd Qu.: 34.42 | 3rd Qu.: 7.900 | 3rd Qu.:13.80 | 3rd Qu.:91.95 | 3rd Qu.:70.40 | 3rd Qu.: 0.0700 | 3rd Qu.: 98.00 | 3rd Qu.: 0.010 | 3rd Qu.: 94.25 | 3rd Qu.:40.15 | 3rd Qu.: 13.170 | 3rd Qu.: 11.600 | 3rd Qu.:349.0 | 3rd Qu.:5.41 | 3rd Qu.: 5.000 | 3rd Qu.:11.50 | 3rd Qu.: 1.540 | 3rd Qu.:7.000 | 3rd Qu.: 4.640 | 3rd Qu.:0.09000 | 3rd Qu.:2.450 | 3rd Qu.:4.820 | 3rd Qu.: 9.915 | 3rd Qu.:11.275 | 3rd Qu.: 7.450 | 3rd Qu.:11.60 | 3rd Qu.: 1436.6 | 3rd Qu.: 45.30 | 3rd Qu.: 17.90 | 3rd Qu.:27.50 | 3rd Qu.:0.0900 | 3rd Qu.:12.050 | 3rd Qu.:4.370 | 3rd Qu.: 41.00 | 3rd Qu.: 347.9 | 3rd Qu.:26.35 | 3rd Qu.:2.220 | 3rd Qu.: 1894 | 3rd Qu.: 618.5 | 3rd Qu.:37.80 | 3rd Qu.: 61.123 | 3rd Qu.:104.10 | 3rd Qu.: 0.6100 | 3rd Qu.:14.50 | 3rd Qu.:35.75 | 3rd Qu.: 55.00 | 3rd Qu.: 1.305 | 3rd Qu.:0.03000 | 3rd Qu.:-1 | 3rd Qu.:32.20 | 3rd Qu.: 44.20 | 3rd Qu.: 99.12 | 3rd Qu.:0.11000 | 3rd Qu.:143.3 | 3rd Qu.:0.280 | 3rd Qu.: 47.00 | 3rd Qu.: 42.00 | 3rd Qu.:105.00 | 3rd Qu.: 97.0 | |
| Max. :50000.00 | Max. :178.0 | Max. :140.4 | Max. :92.10 | Max. :57.170 | Max. :8.6000 | Max. :7500.0 | Max. :620.00 | Max. :47.60 | Max. :1.7000 | Max. :500.00 | Max. :295.40 | Max. :554.0 | Max. :53.000 | Max. :136.00 | Max. :6795.00 | Max. :59.700 | Max. :27.10 | Max. :98.90 | Max. :83.40 | Max. :11.9500 | Max. :142.00 | Max. :250.000 | Max. :117.60 | Max. :52.30 | Max. :1726.600 | Max. :168.000 | Max. :488.0 | Max. :8.95 | Max. :88.500 | Max. :68.40 | Max. :33.690 | Max. :7.565 | Max. :749.500 | Max. :0.46000 | Max. :2.790 | Max. :9.860 | Max. :38.820 | Max. :32.220 | Max. :242.900 | Max. :15.00 | Max. :50000.0 | Max. :113.30 | Max. :144.90 | Max. :48.50 | Max. :2.0900 | Max. :21.000 | Max. :7.300 | Max. :1858.00 | Max. :1176.0 | Max. :33.80 | Max. :2.600 | Max. :70000 | Max. :1867.0 | Max. :62.20 | Max. :5000.000 | Max. :190.80 | Max. :39.9200 | Max. :25.30 | Max. :49.20 | Max. :732.00 | Max. :11.570 | Max. :0.12000 | Max. :-1 | Max. :50.80 | Max. :106.40 | Max. :320.00 | Max. :0.27000 | Max. :179.5 | Max. :0.510 | Max. :110.00 | Max. :1508.00 | Max. :206.90 | Max. :1497.0 | |
| NA’s :69 | NA’s :1 | NA’s :3 | NA’s :5 | NA’s :44 | NA’s :1 | NA’s :140 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :141 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :155 | NA’s :140 | NA’s :2 | NA’s :7 | NA’s :1 | NA’s :1 | NA’s :83 | NA’s :5 | NA’s :83 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :140 | NA’s :1 | NA’s :59 | NA’s :140 | NA’s :1 | NA’s :1 | NA’s :127 | NA’s :1 | NA’s :1 | NA’s :4 | NA’s :3 | NA’s :5 | NA’s :1 | NA’s :1 | NA’s :11 | NA’s :143 | NA’s :7 | NA’s :59 | NA’s :1 | NA’s :83 | NA’s :15 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :3 | NA’s :91 | NA’s :1 | NA’s :11 | NA’s :138 | NA’s :155 | NA’s :1 | NA’s :11 | NA’s :1 | NA’s :1 | NA’s :5 | NA’s :1 | NA’s :140 | NA’s :1 | NA’s :59 | NA’s :4 | NA’s :84 | NA’s :3 | NA’s :11 | NA’s :69 | NA’s :1 | NA’s :1 | NA’s :1 |
5 patients are removed from the dataset, because they contain more than 40 missing values.
Many columns have more than 70 missing values - they won’t be used for the classification.
class_df <- class_df %>% select(-c(`Interleukin 2 receptor`, `Interleukin 10`, `antithrombin`, `Interleukin 8`, `Quantification of Treponema pallidum antibodies`, `HBsAg`, `Tumor necrosis factor alpha`, `Interleukin 1 beta`, `PH value`, `ferritin`, `Amino-terminal brain natriuretic peptide precursor(NT-proBNP)`, `Interleukin 6` , `Fibrin degradation products`, `2019-nCoV nucleic acid detection`, `HIV antibody quantification`, `Hypersensitive cardiac troponinI`, `HCV antibody quantification`, `ESR`))
class_df %>% select(-c(age,gender, outcome)) %>% summary %>% kbl %>% kable_paper("hover") %>% scroll_box(width = "100%", height = "300px")
| hemoglobin | Serum chloride | Prothrombin time | procalcitonin | eosinophils(%) | Alkaline phosphatase | albumin | basophil(%) | Total bilirubin | Platelet count | monocytes(%) | indirect bilirubin | Red blood cell distribution width | neutrophils(%) | total protein | Prothrombin activity | mean corpuscular volume | hematocrit | White blood cell count | mean corpuscular hemoglobin concentration | fibrinogen | Urea | lymphocyte count | Red blood cell count | Eosinophil count | Corrected calcium | Serum potassium | glucose | neutrophils count | Direct bilirubin | Mean platelet volume | RBC distribution width SD | Thrombin time | (%)lymphocyte | D-D dimer | Total cholesterol | aspartate aminotransferase | Uric acid | HCO3- | calcium | Lactate dehydrogenase | platelet large cell ratio | monocytes count | PLT distribution width | globulin | Gamma glutamyl transpeptidase | International standard ratio | basophil count(#) | mean corpuscular hemoglobin | Activation of partial thromboplastin time | hs_CRP | serum sodium | thrombocytocrit | glutamic-pyruvic transaminase | eGFR | creatinine | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 6.4 | Min. : 77.7 | Min. :11.50 | Min. : 0.020 | Min. :0.0000 | Min. : 17.00 | Min. :13.60 | Min. :0.0000 | Min. : 2.80 | Min. : -1.0 | Min. : 0.600 | Min. : 0.100 | Min. :10.60 | Min. : 1.90 | Min. :31.80 | Min. : 7.00 | Min. : 62.30 | Min. :15.60 | Min. : 0.710 | Min. :286.0 | Min. :0.50 | Min. : 1.70 | Min. : 0.050 | Min. : 0.100 | Min. :0.00000 | Min. :1.650 | Min. :2.760 | Min. : 1.000 | Min. : 0.320 | Min. : 1.600 | Min. : 8.50 | Min. : 31.30 | Min. : 13.00 | Min. : 0.30 | Min. : 0.210 | Min. :0.100 | Min. : 6.00 | Min. : 52.0 | Min. : 6.30 | Min. :1.170 | Min. : 110.0 | Min. :11.20 | Min. : 0.0100 | Min. : 8.10 | Min. :10.10 | Min. : 7.00 | Min. : 0.840 | Min. :0.00000 | Min. :20.80 | Min. : 21.80 | Min. : 0.10 | Min. :121.1 | Min. :0.010 | Min. : 5.00 | Min. : 2.00 | Min. : 14.0 | |
| 1st Qu.:112.0 | 1st Qu.: 99.6 | 1st Qu.:13.40 | 1st Qu.: 0.030 | 1st Qu.:0.0000 | 1st Qu.: 54.00 | 1st Qu.:28.20 | 1st Qu.:0.1000 | 1st Qu.: 7.20 | 1st Qu.:113.0 | 1st Qu.: 2.950 | 1st Qu.: 3.725 | 1st Qu.:12.00 | 1st Qu.:61.85 | 1st Qu.:61.20 | 1st Qu.: 67.00 | 1st Qu.: 86.95 | 1st Qu.:33.00 | 1st Qu.: 5.115 | 1st Qu.:332.0 | 1st Qu.:3.18 | 1st Qu.: 3.80 | 1st Qu.: 0.520 | 1st Qu.: 3.550 | 1st Qu.:0.00000 | 1st Qu.:2.260 | 1st Qu.:4.030 | 1st Qu.: 5.120 | 1st Qu.: 3.110 | 1st Qu.: 3.100 | 1st Qu.:10.10 | 1st Qu.: 38.80 | 1st Qu.: 15.60 | 1st Qu.: 4.15 | 1st Qu.: 0.490 | 1st Qu.:2.950 | 1st Qu.: 19.00 | 1st Qu.: 198.5 | 1st Qu.:20.90 | 1st Qu.:1.990 | 1st Qu.: 198.0 | 1st Qu.:25.30 | 1st Qu.: 0.3100 | 1st Qu.:11.00 | 1st Qu.:28.95 | 1st Qu.: 21.00 | 1st Qu.: 1.015 | 1st Qu.:0.01000 | 1st Qu.:29.70 | 1st Qu.: 35.10 | 1st Qu.: 2.00 | 1st Qu.:138.3 | 1st Qu.:0.140 | 1st Qu.: 17.00 | 1st Qu.: 66.75 | 1st Qu.: 58.0 | |
| Median :125.0 | Median :102.3 | Median :14.30 | Median : 0.095 | Median :0.2000 | Median : 71.00 | Median :33.20 | Median :0.2000 | Median : 10.60 | Median :190.0 | Median : 6.200 | Median : 5.300 | Median :12.70 | Median :77.80 | Median :66.00 | Median : 86.00 | Median : 90.40 | Median :36.30 | Median : 7.930 | Median :342.0 | Median :4.22 | Median : 5.40 | Median : 0.990 | Median : 4.100 | Median :0.02000 | Median :2.370 | Median :4.430 | Median : 6.540 | Median : 5.390 | Median : 4.800 | Median :10.80 | Median : 41.20 | Median : 16.50 | Median :14.20 | Median : 1.330 | Median :3.720 | Median : 25.00 | Median : 260.0 | Median :23.90 | Median :2.110 | Median : 274.0 | Median :30.90 | Median : 0.4300 | Median :12.50 | Median :32.40 | Median : 33.00 | Median : 1.100 | Median :0.02000 | Median :30.90 | Median : 38.90 | Median : 26.50 | Median :140.7 | Median :0.210 | Median : 26.00 | Median : 89.40 | Median : 74.0 | |
| Mean :124.4 | Mean :103.3 | Mean :16.05 | Mean : 1.098 | Mean :0.8994 | Mean : 85.68 | Mean :32.65 | Mean :0.2631 | Mean : 16.54 | Mean :192.9 | Mean : 6.519 | Mean : 6.771 | Mean :13.21 | Mean :75.47 | Mean :65.25 | Mean : 81.22 | Mean : 90.62 | Mean :36.58 | Mean : 18.974 | Mean :342.1 | Mean :4.30 | Mean : 9.85 | Mean : 1.163 | Mean : 8.326 | Mean :0.05369 | Mean :2.347 | Mean :4.501 | Mean : 8.525 | Mean : 8.017 | Mean : 9.788 | Mean :10.98 | Mean : 42.83 | Mean : 17.72 | Mean :16.84 | Mean : 6.262 | Mean :3.745 | Mean : 54.35 | Mean : 295.4 | Mean :23.20 | Mean :2.095 | Mean : 477.3 | Mean :32.24 | Mean : 0.5964 | Mean :13.24 | Mean :32.56 | Mean : 49.53 | Mean : 1.299 | Mean :0.02135 | Mean :31.01 | Mean : 41.27 | Mean : 64.98 | Mean :141.8 | Mean :0.213 | Mean : 42.76 | Mean : 81.95 | Mean : 117.9 | |
| 3rd Qu.:138.0 | 3rd Qu.:105.6 | 3rd Qu.:16.30 | 3rd Qu.: 0.450 | 3rd Qu.:1.5000 | 3rd Qu.: 98.00 | 3rd Qu.:37.60 | 3rd Qu.:0.4000 | 3rd Qu.: 16.15 | 3rd Qu.:257.0 | 3rd Qu.: 8.900 | 3rd Qu.: 7.900 | 3rd Qu.:13.80 | 3rd Qu.:91.95 | 3rd Qu.:70.40 | 3rd Qu.: 98.00 | 3rd Qu.: 94.25 | 3rd Qu.:40.15 | 3rd Qu.: 13.170 | 3rd Qu.:349.0 | 3rd Qu.:5.41 | 3rd Qu.:11.50 | 3rd Qu.: 1.540 | 3rd Qu.: 4.640 | 3rd Qu.:0.09000 | 3rd Qu.:2.450 | 3rd Qu.:4.820 | 3rd Qu.: 9.915 | 3rd Qu.:11.275 | 3rd Qu.: 7.450 | 3rd Qu.:11.60 | 3rd Qu.: 45.30 | 3rd Qu.: 17.90 | 3rd Qu.:27.50 | 3rd Qu.:12.050 | 3rd Qu.:4.370 | 3rd Qu.: 41.00 | 3rd Qu.: 347.9 | 3rd Qu.:26.35 | 3rd Qu.:2.220 | 3rd Qu.: 618.5 | 3rd Qu.:37.80 | 3rd Qu.: 0.6100 | 3rd Qu.:14.50 | 3rd Qu.:35.75 | 3rd Qu.: 55.00 | 3rd Qu.: 1.305 | 3rd Qu.:0.03000 | 3rd Qu.:32.20 | 3rd Qu.: 44.20 | 3rd Qu.: 99.12 | 3rd Qu.:143.3 | 3rd Qu.:0.280 | 3rd Qu.: 42.00 | 3rd Qu.:105.00 | 3rd Qu.: 97.0 | |
| Max. :178.0 | Max. :140.4 | Max. :92.10 | Max. :57.170 | Max. :8.6000 | Max. :620.00 | Max. :47.60 | Max. :1.7000 | Max. :295.40 | Max. :554.0 | Max. :53.000 | Max. :59.700 | Max. :27.10 | Max. :98.90 | Max. :83.40 | Max. :142.00 | Max. :117.60 | Max. :52.30 | Max. :1726.600 | Max. :488.0 | Max. :8.95 | Max. :68.40 | Max. :33.690 | Max. :749.500 | Max. :0.46000 | Max. :2.790 | Max. :9.860 | Max. :38.820 | Max. :32.220 | Max. :242.900 | Max. :15.00 | Max. :113.30 | Max. :144.90 | Max. :48.50 | Max. :21.000 | Max. :7.300 | Max. :1858.00 | Max. :1176.0 | Max. :33.80 | Max. :2.600 | Max. :1867.0 | Max. :62.20 | Max. :39.9200 | Max. :25.30 | Max. :49.20 | Max. :732.00 | Max. :11.570 | Max. :0.12000 | Max. :50.80 | Max. :106.40 | Max. :320.00 | Max. :179.5 | Max. :0.510 | Max. :1508.00 | Max. :206.90 | Max. :1497.0 | |
| NA’s :1 | NA’s :3 | NA’s :5 | NA’s :44 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :2 | NA’s :7 | NA’s :1 | NA’s :1 | NA’s :5 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :59 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :4 | NA’s :3 | NA’s :5 | NA’s :1 | NA’s :1 | NA’s :11 | NA’s :7 | NA’s :59 | NA’s :1 | NA’s :15 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :3 | NA’s :1 | NA’s :11 | NA’s :1 | NA’s :11 | NA’s :1 | NA’s :1 | NA’s :5 | NA’s :1 | NA’s :1 | NA’s :59 | NA’s :4 | NA’s :3 | NA’s :11 | NA’s :1 | NA’s :1 | NA’s :1 |
The remaining values in the dataset are replaced with median for the whole column - due to very skewed distribution. The summary of clean dataset, with no missing values is presented below.
class_df <- class_df %>% na_mean(option = "median")
class_df %>% select(-c(age,gender, outcome)) %>% summary %>% kbl %>% kable_paper("hover") %>% scroll_box(width = "100%", height = "300px")
| hemoglobin | Serum chloride | Prothrombin time | procalcitonin | eosinophils(%) | Alkaline phosphatase | albumin | basophil(%) | Total bilirubin | Platelet count | monocytes(%) | indirect bilirubin | Red blood cell distribution width | neutrophils(%) | total protein | Prothrombin activity | mean corpuscular volume | hematocrit | White blood cell count | mean corpuscular hemoglobin concentration | fibrinogen | Urea | lymphocyte count | Red blood cell count | Eosinophil count | Corrected calcium | Serum potassium | glucose | neutrophils count | Direct bilirubin | Mean platelet volume | RBC distribution width SD | Thrombin time | (%)lymphocyte | D-D dimer | Total cholesterol | aspartate aminotransferase | Uric acid | HCO3- | calcium | Lactate dehydrogenase | platelet large cell ratio | monocytes count | PLT distribution width | globulin | Gamma glutamyl transpeptidase | International standard ratio | basophil count(#) | mean corpuscular hemoglobin | Activation of partial thromboplastin time | hs_CRP | serum sodium | thrombocytocrit | glutamic-pyruvic transaminase | eGFR | creatinine | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 6.4 | Min. : 77.7 | Min. :11.50 | Min. : 0.0200 | Min. :0.0000 | Min. : 17.00 | Min. :13.60 | Min. :0.0000 | Min. : 2.80 | Min. : -1.0 | Min. : 0.600 | Min. : 0.100 | Min. :10.6 | Min. : 1.90 | Min. :31.80 | Min. : 7.00 | Min. : 62.30 | Min. :15.60 | Min. : 0.710 | Min. :286.0 | Min. :0.500 | Min. : 1.700 | Min. : 0.050 | Min. : 0.100 | Min. :0.0000 | Min. :1.650 | Min. :2.760 | Min. : 1.000 | Min. : 0.320 | Min. : 1.600 | Min. : 8.50 | Min. : 31.3 | Min. : 13.00 | Min. : 0.300 | Min. : 0.210 | Min. :0.100 | Min. : 6.00 | Min. : 52.0 | Min. : 6.30 | Min. :1.170 | Min. : 110.0 | Min. :11.2 | Min. : 0.0100 | Min. : 8.10 | Min. :10.10 | Min. : 7.00 | Min. : 0.840 | Min. :0.00000 | Min. :20.80 | Min. : 21.80 | Min. : 0.100 | Min. :121.1 | Min. :0.0100 | Min. : 5.00 | Min. : 2.00 | Min. : 14.0 | |
| 1st Qu.:112.0 | 1st Qu.: 99.6 | 1st Qu.:13.40 | 1st Qu.: 0.0400 | 1st Qu.:0.0000 | 1st Qu.: 54.00 | 1st Qu.:28.20 | 1st Qu.:0.1000 | 1st Qu.: 7.20 | 1st Qu.:113.0 | 1st Qu.: 2.975 | 1st Qu.: 3.775 | 1st Qu.:12.1 | 1st Qu.:61.88 | 1st Qu.:61.20 | 1st Qu.: 67.00 | 1st Qu.: 86.97 | 1st Qu.:33.00 | 1st Qu.: 5.117 | 1st Qu.:332.0 | 1st Qu.:3.417 | 1st Qu.: 3.800 | 1st Qu.: 0.520 | 1st Qu.: 3.550 | 1st Qu.:0.0000 | 1st Qu.:2.260 | 1st Qu.:4.037 | 1st Qu.: 5.143 | 1st Qu.: 3.115 | 1st Qu.: 3.100 | 1st Qu.:10.10 | 1st Qu.: 38.8 | 1st Qu.: 15.80 | 1st Qu.: 4.175 | 1st Qu.: 0.510 | 1st Qu.:2.950 | 1st Qu.: 19.00 | 1st Qu.: 198.8 | 1st Qu.:20.90 | 1st Qu.:1.990 | 1st Qu.: 199.0 | 1st Qu.:25.4 | 1st Qu.: 0.3100 | 1st Qu.:11.07 | 1st Qu.:28.98 | 1st Qu.: 21.00 | 1st Qu.: 1.020 | 1st Qu.:0.01000 | 1st Qu.:29.70 | 1st Qu.: 36.08 | 1st Qu.: 2.075 | 1st Qu.:138.3 | 1st Qu.:0.1400 | 1st Qu.: 17.00 | 1st Qu.: 66.78 | 1st Qu.: 58.0 | |
| Median :125.0 | Median :102.3 | Median :14.30 | Median : 0.0950 | Median :0.2000 | Median : 71.00 | Median :33.20 | Median :0.2000 | Median : 10.60 | Median :190.0 | Median : 6.200 | Median : 5.300 | Median :12.7 | Median :77.80 | Median :66.00 | Median : 86.00 | Median : 90.40 | Median :36.30 | Median : 7.930 | Median :342.0 | Median :4.220 | Median : 5.400 | Median : 0.990 | Median : 4.100 | Median :0.0200 | Median :2.370 | Median :4.430 | Median : 6.540 | Median : 5.390 | Median : 4.800 | Median :10.80 | Median : 41.2 | Median : 16.50 | Median :14.200 | Median : 1.330 | Median :3.720 | Median : 25.00 | Median : 260.0 | Median :23.90 | Median :2.110 | Median : 274.0 | Median :30.9 | Median : 0.4300 | Median :12.50 | Median :32.40 | Median : 33.00 | Median : 1.100 | Median :0.02000 | Median :30.90 | Median : 38.90 | Median : 26.500 | Median :140.7 | Median :0.2100 | Median : 26.00 | Median : 89.40 | Median : 74.0 | |
| Mean :124.4 | Mean :103.3 | Mean :16.02 | Mean : 0.9742 | Mean :0.8975 | Mean : 85.64 | Mean :32.66 | Mean :0.2629 | Mean : 16.52 | Mean :192.9 | Mean : 6.518 | Mean : 6.762 | Mean :13.2 | Mean :75.48 | Mean :65.25 | Mean : 81.28 | Mean : 90.62 | Mean :36.58 | Mean : 18.943 | Mean :342.1 | Mean :4.286 | Mean : 9.838 | Mean : 1.162 | Mean : 8.314 | Mean :0.0536 | Mean :2.347 | Mean :4.500 | Mean : 8.497 | Mean : 8.009 | Mean : 9.774 | Mean :10.97 | Mean : 42.8 | Mean : 17.52 | Mean :16.837 | Mean : 6.054 | Mean :3.745 | Mean : 54.27 | Mean : 295.3 | Mean :23.20 | Mean :2.095 | Mean : 476.7 | Mean :32.2 | Mean : 0.5959 | Mean :13.21 | Mean :32.56 | Mean : 49.49 | Mean : 1.296 | Mean :0.02135 | Mean :31.01 | Mean : 40.88 | Mean : 64.543 | Mean :141.8 | Mean :0.2129 | Mean : 42.72 | Mean : 81.97 | Mean : 117.7 | |
| 3rd Qu.:138.0 | 3rd Qu.:105.5 | 3rd Qu.:16.30 | 3rd Qu.: 0.3525 | 3rd Qu.:1.5000 | 3rd Qu.: 98.00 | 3rd Qu.:37.60 | 3rd Qu.:0.4000 | 3rd Qu.: 16.12 | 3rd Qu.:257.0 | 3rd Qu.: 8.900 | 3rd Qu.: 7.900 | 3rd Qu.:13.8 | 3rd Qu.:91.92 | 3rd Qu.:70.40 | 3rd Qu.: 97.25 | 3rd Qu.: 94.22 | 3rd Qu.:40.12 | 3rd Qu.: 13.155 | 3rd Qu.:349.0 | 3rd Qu.:5.145 | 3rd Qu.:11.500 | 3rd Qu.: 1.540 | 3rd Qu.: 4.635 | 3rd Qu.:0.0900 | 3rd Qu.:2.450 | 3rd Qu.:4.812 | 3rd Qu.: 9.675 | 3rd Qu.:11.242 | 3rd Qu.: 7.425 | 3rd Qu.:11.60 | 3rd Qu.: 45.2 | 3rd Qu.: 17.50 | 3rd Qu.:27.500 | 3rd Qu.:10.515 | 3rd Qu.:4.370 | 3rd Qu.: 41.00 | 3rd Qu.: 347.4 | 3rd Qu.:26.32 | 3rd Qu.:2.220 | 3rd Qu.: 617.8 | 3rd Qu.:37.6 | 3rd Qu.: 0.6100 | 3rd Qu.:14.40 | 3rd Qu.:35.73 | 3rd Qu.: 55.00 | 3rd Qu.: 1.300 | 3rd Qu.:0.03000 | 3rd Qu.:32.20 | 3rd Qu.: 42.83 | 3rd Qu.: 98.950 | 3rd Qu.:143.3 | 3rd Qu.:0.2700 | 3rd Qu.: 42.00 | 3rd Qu.:105.00 | 3rd Qu.: 97.0 | |
| Max. :178.0 | Max. :140.4 | Max. :92.10 | Max. :57.1700 | Max. :8.6000 | Max. :620.00 | Max. :47.60 | Max. :1.7000 | Max. :295.40 | Max. :554.0 | Max. :53.000 | Max. :59.700 | Max. :27.1 | Max. :98.90 | Max. :83.40 | Max. :142.00 | Max. :117.60 | Max. :52.30 | Max. :1726.600 | Max. :488.0 | Max. :8.950 | Max. :68.400 | Max. :33.690 | Max. :749.500 | Max. :0.4600 | Max. :2.790 | Max. :9.860 | Max. :38.820 | Max. :32.220 | Max. :242.900 | Max. :15.00 | Max. :113.3 | Max. :144.90 | Max. :48.500 | Max. :21.000 | Max. :7.300 | Max. :1858.00 | Max. :1176.0 | Max. :33.80 | Max. :2.600 | Max. :1867.0 | Max. :62.2 | Max. :39.9200 | Max. :25.30 | Max. :49.20 | Max. :732.00 | Max. :11.570 | Max. :0.12000 | Max. :50.80 | Max. :106.40 | Max. :320.000 | Max. :179.5 | Max. :0.5100 | Max. :1508.00 | Max. :206.90 | Max. :1497.0 |
The preprocessed data is split into two datasets training and testing.
The patients are grouped by outcome, first in the dataset are patients who survived and than those who died. Below dataset shuffle and check is done to be sure that the training and testing sets have similar output class distribution.
To ensure the repeatability of experiments, seed is set to 23.
set.seed(23)
rows <- sample(nrow(class_df))
class_df <- class_df[rows,]
set.seed(23)
inTraining <- createDataPartition(y = class_df$outcome, p=.70, list=FALSE)
training <- class_df[inTraining,]
testing <- class_df[-inTraining,]
Training set summary
training %>% select(gender, outcome) %>% tbl_summary(by = outcome) %>% as_kable() %>% kable_paper("hover")
| Characteristic | Died, N = 115 | Survived, N = 136 |
|---|---|---|
| gender | ||
| female | 36 (31%) | 77 (57%) |
| male | 79 (69%) | 59 (43%) |
Testing set summary
testing %>% select(gender, outcome) %>% tbl_summary(by = outcome) %>% as_kable() %>% kable_paper("hover")
| Characteristic | Died, N = 48 | Survived, N = 57 |
|---|---|---|
| gender | ||
| female | 9 (19%) | 25 (44%) |
| male | 39 (81%) | 32 (56%) |
For the learning process Repeated 2 fold Cross-Validation was used - the training process will be repeated 5 times.
set.seed(23)
ctrl <- trainControl(
method = "repeatedcv",
number = 2,
repeats = 5,
classProbs = TRUE)
Te measure the performance of the model three measures are considered: accuracy, ROC curve, and AUC.
The Random Forest model is trained with default parameters, but with a number of trees in the forest set to 10 and metric used for tuning the model as ROC.
rfGrid <- expand.grid(mtry = 10:20)
set.seed(23)
rf_fit <- train(outcome ~ .,
data = training,
method = "rf",
preProc = c("center", "scale"),
trControl = ctrl,
tuneGrid = rfGrid,
ntree = 15)
rf_fit
## Random Forest
##
## 251 samples
## 58 predictor
## 2 classes: 'Died', 'Survived'
##
## Pre-processing: centered (58), scaled (58)
## Resampling: Cross-Validated (2 fold, repeated 5 times)
## Summary of sample sizes: 126, 125, 125, 126, 125, 126, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 10 0.9553587 0.9101052
## 11 0.9593079 0.9182601
## 12 0.9593460 0.9180843
## 13 0.9577079 0.9149460
## 14 0.9593143 0.9180618
## 15 0.9601206 0.9198597
## 16 0.9584952 0.9168637
## 17 0.9601206 0.9198335
## 18 0.9625206 0.9245983
## 19 0.9633270 0.9262290
## 20 0.9649016 0.9294723
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 20.
The model works very well with 97% accuracy. There are 3 badly classified patients, but this type of error is less harmful (FN).
rf_classes <- predict(rf_fit, newdata = testing)
rf_classes_prob <- predict(rf_fit, newdata = testing, type = "prob")
caret::confusionMatrix(data = rf_classes, testing$outcome)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Died Survived
## Died 48 3
## Survived 0 54
##
## Accuracy : 0.9714
## 95% CI : (0.9188, 0.9941)
## No Information Rate : 0.5429
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9427
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 1.0000
## Specificity : 0.9474
## Pos Pred Value : 0.9412
## Neg Pred Value : 1.0000
## Prevalence : 0.4571
## Detection Rate : 0.4571
## Detection Prevalence : 0.4857
## Balanced Accuracy : 0.9737
##
## 'Positive' Class : Died
##
Presented ROC curve is very convex which means that the model works very well. Also the AUC value is very high, close to 1.
rf_ROC <- roc(response = testing$outcome,
predictor = rf_classes_prob[, "Died"],
levels = rev(levels(testing$outcome)),
plot = TRUE,
auc = TRUE,
print.auc = TRUE)
rf_ROC
##
## Call:
## roc.default(response = testing$outcome, predictor = rf_classes_prob[, "Died"], levels = rev(levels(testing$outcome)), auc = TRUE, plot = TRUE, print.auc = TRUE)
##
## Data: rf_classes_prob[, "Died"] in 57 controls (testing$outcome Survived) < 48 cases (testing$outcome Died).
## Area under the curve: 0.9987
Presented below classification variables and their importance show, that there are just few very important, decisive variable which model uses. Three variables: LDH, lymphocyte, and hs-CRP, are marked as the most important variables to predict the mortality of Covid-19 patients - the same as in the article An interpretable mortality prediction model for COVID-19 patients.
importance <- varImp(rf_fit)
importance
## rf variable importance
##
## only 20 most important variables shown (out of 58)
##
## Overall
## `Lactate dehydrogenase` 100.000
## `(%)lymphocyte` 33.492
## hs_CRP 27.732
## `lymphocyte count` 15.807
## `neutrophils(%)` 13.289
## `Prothrombin time` 12.755
## `International standard ratio` 10.094
## `Platelet count` 4.623
## `aspartate aminotransferase` 3.872
## procalcitonin 3.417
## `HCO3-` 2.343
## `Total cholesterol` 2.080
## `platelet large cell ratio` 2.016
## thrombocytocrit 2.014
## age 2.013
## `glutamic-pyruvic transaminase` 1.648
## `Activation of partial thromboplastin time` 1.637
## eGFR 1.266
## glucose 1.240
## `eosinophils(%)` 1.212
As further work development, the most important variables (importance > 5) could be used to train the model to get 100% classification accuracy. Extra model evaluation could be done with learning curves to detect overfitting.